NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection

Ganjdanesh, Alireza; Kang, Yan; Liu, Yuchen; Zhang, Richard; Lin, Zhe; Huang, Heng (August 2024, European Conference on Computer Vision (ECCV 2024))

Full Text Available
Pirate: No Compromise Low-Bandwidth VR Streaming for Edge Devices

https://doi.org/10.1145/3676641.3716268

Zhang, Yingtian; Kang, Yan; Ying, Ziyu; Lu, Wanhang; Lan, Sijie; Xu, Huijuan; Maeng, Kiwan; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (March 2025, ACM)

Full Text Available
SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

https://doi.org/10.1109/CVPR52733.2024.00827

Li, Zhengang; Kang, Yan; Liu, Yuchen; Liu, Difan; Hinz, Tobias; Liu, Feng; Wang, Yanzhi (June 2024, IEEE)

Full Text Available
Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

https://doi.org/10.1109/CVPR52733.2024.01522

Wang, Hongjie; Liu, Difan; Kang, Yan; Li, Yijun; Lin, Zhe; Jha, Niraj K; Liu, Yuchen (June 2024, IEEE)

Full Text Available
Studying CPU and memory utilization of applications on Fujitsu A64FX and Nvidia Grace Superchip

https://doi.org/10.1145/3695794.3695813

Kang, Yan; Ghosh, Sayan; Kandemir, Mahmut; Márquez, Andrés (September 2024, ACM)

Full Text Available
EdgePC: Efficient Deep Learning Analytics for Point Clouds on Edge Devices

https://doi.org/10.1145/3579371.3589113

Ying, Ziyu; Bhuyan, Sandeepa; Kang, Yan; Zhang, Yingtian; Kandemir, Mahmut T.; Das, Chita R. (January 2023, International Symposium on Computer Architecture 2023)

Recently, point cloud (PC) has gained popularity in modeling various 3D objects (including both synthetic and real-life) and has been extensively utilized in a wide range of applications such as AR/VR, 3D reconstruction, and autonomous driving. For such applications, it is critical to analyze/understand the surrounding scenes properly. To achieve this, deep learning based methods (e.g., convolutional neural networks (CNNs)) have been widely employed for higher accuracy. Unlike the deep learning on conventional 2D images/videos, where the feature computation (matrix multiplication) is the major bottleneck, in point cloud-based CNNs, the sample and neighbor search stages are the primary bottlenecks, and collectively contribute to 54% (up to 80%) of the overall execution latency on a typical edge device. While prior efforts have attempted to solve this issue by designing custom ASICs or pipelining the neighbor search with other stages, to our knowledge, none of them has tried to “structurize” the unstructured PC data for improving computational efficiency. In this paper, we first explore the opportunities of structurizing PC data using Morton code (which is originally designed to map data from a high dimensional space to one dimension, while preserving spatial locality) and observe that there is a huge scope to “skip” the sample and neighbor search computation by operating on the “structurized” PC data. Based on this, we propose two approximation techniques for the sampling and neighbor search stages. We implemented our proposals on an NVIDIA Jetson AGX Xavier edge GPU board. The evaluation results collected on six different workloads show that our design can accelerate the sample and neighbor search stages by 3.68× (up to 5.21×) with minimal impact on inference accuracy. This acceleration in turn results in 1.55× speedup in the end-to-end execution latency and saves 33% of energy expenditure.
more » « less
Full Text Available
Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms

https://doi.org/10.1145/3437359.3465592

Michalowicz, Benjamin; Raut, Eric; Kang, Yan; Curtis, Tony; Chapman, Barbara; Oryspayev, Dossay (July 2021, PEARC '21: Practice and Experience in Advanced Research Computing)
null (Ed.)
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world’s fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
more » « less
Full Text Available
SEIMI: Efficient and Secure SMAP-Enabled Intra-process Memory Isolation

https://doi.org/10.1109/SP40000.2020.00087

Wang, Zhe; Wu, Chenggang; Xie, Mengyao; Zhang, Yinqian; Lu, Kangjie; Zhang, Xiaofeng; Lai, Yuanming; Kang, Yan; Yang, Min (May 2020, the 41st IEEE Symposium on Security and Privacy (Oakland'20))
null (Ed.)
Full Text Available
Ookami: Deployment and Initial Experiences

https://doi.org/10.1145/3437359.3465578

Burford, Andrew; Calder, Alan; Carlson, David; Chapman, Barbara; Coskun, Firat; Curtis, Tony; Feldman, Catherine; Harrison, Robert; Kang, Yan; Michalowicz, Benjamin; et al (July 2021, PEARC '21: Practice and Experience in Advanced Research Computing)
null (Ed.)
Ookami [3] is a computer technology testbed supported by the United States National Science Foundation. It provides researchers with access to the A64FX processor developed by Fujitsu [17] in collaboration with RIKΞN [35, 37] for the Japanese path to exascale computing, as deployed in Fugaku [36], the fastest computer in the world [34]. By focusing on crucial architectural details, the ARM-based, multi-core, 512-bit SIMD-vector processor with ultrahigh-bandwidth memory promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications. We review relevant technology and system details, and the main body of the paper focuses on initial experiences with the hardware and software ecosystem for micro-benchmarks, mini-apps, and full applications, and starts to answer questions about where such technologies fit into the NSF ecosystem.
more » « less
Full Text Available

Search for: All records